Recent success of vision transformers has inspired a series of vision backbones with novel feature transformation paradigms, which report steady performance gain. Although the novel feature transformation designs are often claimed as the source of gain, some backbones may benefit from advanced engineering techniques, which makes it hard to identify the real gain from the key feature transformation operators. In this paper, we aim to identify real gain of popular convolution and attention operators and make an in-depth study of them. We observe that the main difference among these feature transformation modules, e.g., attention or convolution, lies in the way of spatial feature aggregation, or the so-called "spatial token mixer" (STM). Hence, we first elaborate a unified architecture to eliminate the unfair impact of different engineering techniques, and then fit STMs into this architecture for comparison. Based on various experiments on upstream/downstream tasks and the analysis of inductive bias, we find that the engineering techniques boost the performance significantly, but the performance gap still exists among different STMs. The detailed analysis also reveals some interesting findings of different STMs, such as effective receptive fields and invariance tests. The code and trained models will be publicly available at https://github.com/OpenGVLab/STM-Evaluation
translated by 谷歌翻译
在以前的基于深度学习的方法中,语义分割被认为是静态或动态的每个像素分类任务,\ textit {i.e。,}将每个像素表示分类为特定类别。但是,这些方法仅着眼于学习更好的像素表示或分类内核,同时忽略对象的结构信息,这对于人类决策机制至关重要。在本文中,我们提出了一种用于语义分割的新范式,称为结构感知的提取。具体而言,它通过一组可学习的结构令牌与图像特征之间的相互作用生成分割结果,该功能旨在从功能中逐步提取每个类别的结构信息。广泛的实验表明,我们的结构量优先于三个广泛使用的基准,包括ADE20K,CityScapes和Coco-Stuff-10k。
translated by 谷歌翻译
最近,变形金刚在各种视觉任务中表现出具有很大的表现。为了降低全球自我关注引起的二次计算复杂性,各种方法限制了本地区域内的注意范围以提高其效率。因此,单个注意层中的接收领域不够大,导致上下文建模不足。为了解决这个问题,我们提出了一种浅色的自我关注(PS-Legution),这在浅层形状的地区内进行自我关注。与全球自我关注相比,PS-Peponsion可以显着降低计算和内存成本。同时,它可以通过以前的本地自我关注机制捕获类似的计算复杂性下的更丰富的上下文信息。根据PS-Intension,我们开发了一个具有分层架构的一般视觉变压器骨干,名为苍白变压器,其达到83.4%,84.3%和84.9%的前1个精度,分别为22米,48米和85米对于224个Imagenet-1K分类,优于上一个视觉变压器骨干板。对于下游任务,我们的苍白变压器骨干在ADE20K语义分割和Coco对象检测和实例分割中,我们的苍白变压器骨干比最近最近的最新的克斯卡文变压器表现更好。代码将在https://github.com/br -dl/paddlevit上发布。
translated by 谷歌翻译
由于长距离依赖性建模的能力,变压器在各种自然语言处理和计算机视觉任务中表现出令人印象深刻的性能。最近的进展证明,将这种变压器与基于CNN的语义图像分割模型相结合非常有前途。然而,目前还没有很好地研究了纯变压器的方法如何实现图像分割。在这项工作中,我们探索了语义图像分割的新框架,它是基于编码器 - 解码器的完全变压器网络(FTN)。具体地,我们首先提出金字塔组变压器(PGT)作为逐步学习分层特征的编码器,同时降低标准视觉变压器(VIT)的计算复杂性。然后,我们将特征金字塔变换器(FPT)提出了来自PGT编码器的多电平进行语义图像分割的多级别的语义级别和空间级信息。令人惊讶的是,这种简单的基线可以在多个具有挑战性的语义细分和面部解析基准上实现更好的结果,包括帕斯卡背景,ADE20K,Cocostuff和Celebamask-HQ。源代码将在https://github.com/br -dl/paddlevit上发布。
translated by 谷歌翻译
症状信息主要记录在自由文本临床笔记中,并且无法直接用于下游应用。为了应对这一挑战,需要采用可以处理不同机构和专业的临床语言变化的信息提取方法。在本文中,我们使用预处理和微调数据介绍了症状提取的领域概括,这些数据在机构和/或专业和患者人群方面与目标领域不同。我们使用基于变压器的联合实体和关系提取方法提取症状事件。为了减少对域特异性特征的依赖,我们提出了一种域的概括方法,该方法可以动态掩盖源域中的频繁症状单词。此外,我们将变压器语言模型(LM)预先限定在与任务相关的无标记文本上,以更好地表示。我们的实验表明,当源域与目标域更遥远时,掩盖和自适应预处理方法可以显着提高性能。
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Nearest-Neighbor (NN) classification has been proven as a simple and effective approach for few-shot learning. The query data can be classified efficiently by finding the nearest support class based on features extracted by pretrained deep models. However, NN-based methods are sensitive to the data distribution and may produce false prediction if the samples in the support set happen to lie around the distribution boundary of different classes. To solve this issue, we present P3DC-Shot, an improved nearest-neighbor based few-shot classification method empowered by prior-driven data calibration. Inspired by the distribution calibration technique which utilizes the distribution or statistics of the base classes to calibrate the data for few-shot tasks, we propose a novel discrete data calibration operation which is more suitable for NN-based few-shot classification. Specifically, we treat the prototypes representing each base class as priors and calibrate each support data based on its similarity to different base prototypes. Then, we perform NN classification using these discretely calibrated support data. Results from extensive experiments on various datasets show our efficient non-learning based method can outperform or at least comparable to SOTA methods which need additional learning steps.
translated by 谷歌翻译